Suppose we construct two new variables \(Y^*\) and \(X^*\):
\[Y^* = 10 \cdot Y + 2\] \[X^* = 0.1 \cdot X\]
Suppose our sample data remain the same, but we now regress \(Y^∗\) on \(X^∗\) and obtain the following prediction equation: \[\hat Y^* = a^* + b^*X\]
How will the new intercept (\(a^∗\)) and slope (\(b^∗\)) relate to the original \(a\) and \(b\)? (i.e., I’m asking you to express \(a^∗\) and \(b^∗\) as functions of \(a\) and \(b\).)
Based on the question, we can express \(X\) and \(Y\) using \(X^*\) and \(Y^*\) by transforming the \(X^*\) and \(Y^*\) expressions to:
\[Y = \frac{Y^* -2}{10} \] \[X = 10\cdot X^*\] Then \(Y = a + bX\) becomes:
\[\frac{Y^* -2}{10} = a + b\cdot 10\cdot X^*\] Arrange this to the form of \(\hat Y^* = a^* + b^*X^*\):
\[Y^* -2 = 10 \cdot (a + b\cdot 10\cdot X^*)\] \[Y^* = 10 \cdot (a + b\cdot 10\cdot X^*) + 2\] \[Y^* = 10a + 100b\cdot X^* + 2\] Now, gather the intercept part of the above equation, which are terms whose value is not dependent on \(X\), which include \(10a\) and \(+2\). Then, gather the slope part, which are terms whose value is dependent on \(X^*\), which is \(100b\cdot X^*\):
\[Y^* = (10a + 2) + 100 b\cdot X^*\] The above equation is \(\hat Y^* = a^* + b^*X\), and the \(a^*\) part is \((10a + 2)\), the \(b^*\) part is \(100b\).
First, start with the equation \(\hat Y^* = a^* + b^*X\). Given that \(Y^* = 10Y + 2\) and \(X^* = 0.1X\), we can write \(\hat Y^* = a^* + b^*X\) as:
\[10Y + 2 = a^* + b^*\cdot 0.1X\] \[10Y = a^* + 0.1 b^*\cdot X - 2\] \[Y = \frac{a^* + 0.1 b^*\cdot X - 2}{10}\] \[Y = \frac{a^*}{10} + 0.01 b^*\cdot X - 0.2\]
Again, gather the intercept and the slope terms:
\[Y = (\frac{a^*}{10} - 0.2) + 0.01b^*\cdot X \] Since we know \(X\) and \(Y\) is related by \(Y = a + bX\), we now have:
\[a = (\frac{a^*}{10} - 0.2)\] \[b = 0.01\cdot b^*\] Then transform the above two equations to use \(a\) to express \(a^*\) and \(b\) to express \(b^*\):
\[\frac{a^*}{10} - 0.2 =a\] \[\frac{a^*}{10} =a + 0.2\] \[a^* =10a + 2\] And for \(b^*\):
\[b^* = 100b\]
Now we continue last week’s topic on visualizing results of a regression model. Today we will learn two more plots: the coefficient plot and the plot of predicted effect.
First, load packages to your environment.
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
p_load(tidyverse, stargazer, kableExtra, coefplot)
Second, load the data and the model we ran last week.
# Load cleaned and recoded df
load("data/earnings_df.RData")
# Examine data
head(earnings_df, 10) %>% kbl("html") %>% kable_classic_2(full_width = F)
| earn | edu | age_recode | female | black | other |
|---|---|---|---|---|---|
| 14.67010 | 3 | 28 | 1 | 0 | 1 |
| 71.91367 | 7 | 53 | 0 | 0 | 0 |
| 54.62454 | 7 | 41 | 0 | 0 | 0 |
| 64.54969 | 7 | 49 | 0 | 0 | 0 |
| 31.36042 | 9 | 41 | 1 | 0 | 0 |
| 56.65916 | 7 | 63 | 0 | 0 | 0 |
| 26.21015 | 1 | 60 | 0 | 1 | 0 |
| 52.45393 | 4 | 55 | 0 | 0 | 0 |
| 37.90477 | 7 | 23 | 1 | 1 | 0 |
| 46.28573 | 6 | 49 | 1 | 0 | 1 |
# Estimate Nested Models
m1 <- lm(earn ~ age_recode,
data = earnings_df)
m2 <- lm(earn ~ age_recode + edu,
data = earnings_df)
m3 <- lm(earn ~ age_recode + edu + female,
data = earnings_df)
m4 <- lm(earn ~ age_recode + edu + female + black + other,
data = earnings_df)
m5 <- lm(earn ~ age_recode + edu + female + black + other + edu*female,
data = earnings_df)
# Examine models
stargazer(m1, m2, m3, m4, m5, type="text", omit.stat=c("ser", "f"))
##
## ================================================================
## Dependent variable:
## ---------------------------------------------------
## earn
## (1) (2) (3) (4) (5)
## ----------------------------------------------------------------
## age_recode 0.134*** 0.132*** 0.160*** 0.158*** 0.156***
## (0.042) (0.033) (0.022) (0.022) (0.019)
##
## edu 4.314*** 4.500*** 4.477*** 6.083***
## (0.171) (0.112) (0.112) (0.143)
##
## female -20.528*** -20.572*** -1.571
## (0.568) (0.565) (1.311)
##
## black -2.307*** -2.385***
## (0.623) (0.557)
##
## other -0.767 -0.946
## (1.137) (1.017)
##
## edu:female -3.128***
## (0.199)
##
## Constant 43.888*** 17.786*** 25.439*** 26.429*** 16.974***
## (1.917) (1.817) (1.207) (1.230) (1.254)
##
## ----------------------------------------------------------------
## Observations 980 980 980 980 980
## R2 0.010 0.400 0.744 0.747 0.798
## Adjusted R2 0.009 0.399 0.743 0.746 0.797
## ================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Coefficient plot visualizes the coefficients with it’s confidence intervals. You can plot it easily using coefplot() from the coefplot package. There are also other packages that visualize coefficients.
# Defualt coefficient plot
coefplot(m5)
# Remove the intercept from the plot
coefplot(m5, intercept = F)
# The default innerCI is 1, which is 1 se around the point estimate
# the default outerCI is 2, which is 2 se around the point estimate
# You can set both to 1.96, which is the 95% confidence interval of betas
coefplot(m5, intercept = F, innerCI = 1.96, outerCI = 1.96)
# Or only keep the outerCI = 1.96
coefplot(m5, intercept = T, innerCI = F, outerCI = 1.96)
# You can also change the color, shape, and size of the texts
# as well as change plot titles and axes labels
# read the documentation for more
coefplot(m5, intercept = F, innerCI = F, outerCI = 1.96,
color = "black", # customize color
title = "Coefficient Plot for Model 5") # customize title
We can visualize the predicted effect of key predictors using the predict() function in base R.
The idea behind this task is to first create a dataframe with values of all the predictors included in the model, with only the value of your predictor(s) of interest vary, whereas other predictors held at their mean.
For example, if we want to examine the effect of education and gender on earnings, we create a dataframe with a variable edu that varies from 0 to 15 with an interval of 1 (so edu = 0, 1, 2, …, 14, 15).
We repeat this number sequence for another time so that we have each level of education for both male and female. So we need to generate edu = 0, 1, 2, …, 14, 15, 0, 1, 2, …, 14, 15. We use rep(0:15, 2) to generate this number sequence.
rep(x, times) replicate x (a vector or list) for user-defined times (in our case, times = 2). You can run this in your R console to see what number sequence is returned.
Then, we generate a dummy variable female that equals to 0 for male and 1 for female.
To create a dataframe that have the combination of each level of edu and each gender category, we let female = 0 for 16 times, and female = 1 for 16 times, using c(rep(0, 16), rep(1, 16)). You can run this in your R console to see what number sequence is returned.
For the rest of the predictors, we fix them at their mean. We add na.rm = T in the mean() function to specify how we want to deal with NA values. If you don’t include na.rm = T, mean() will return NA if your variable contains NAs.
# First, we create a dataframe with all predictor variables
# with only the key IV varies
pred_IV <- tibble(edu = rep(0:15, 2)) %>% #first, create a df with values of your key IV
mutate(female = c(rep(0, 16), rep(1, 16)), #b/c we are looking at interaction effects,
#give gender two values, otherwise fix it at mean
age_recode = mean(earnings_df$age_recode, na.rm = T), # fix other variabes at mean
black = mean(earnings_df$black),
other = mean(earnings_df$other))
# Examine the df
head(pred_IV, 20) %>% kbl("html") %>% kable_classic_2(full_width = F)
| edu | female | age_recode | black | other |
|---|---|---|---|---|
| 0 | 0 | 43.26429 | 0.306 | 0.068 |
| 1 | 0 | 43.26429 | 0.306 | 0.068 |
| 2 | 0 | 43.26429 | 0.306 | 0.068 |
| 3 | 0 | 43.26429 | 0.306 | 0.068 |
| 4 | 0 | 43.26429 | 0.306 | 0.068 |
| 5 | 0 | 43.26429 | 0.306 | 0.068 |
| 6 | 0 | 43.26429 | 0.306 | 0.068 |
| 7 | 0 | 43.26429 | 0.306 | 0.068 |
| 8 | 0 | 43.26429 | 0.306 | 0.068 |
| 9 | 0 | 43.26429 | 0.306 | 0.068 |
| 10 | 0 | 43.26429 | 0.306 | 0.068 |
| 11 | 0 | 43.26429 | 0.306 | 0.068 |
| 12 | 0 | 43.26429 | 0.306 | 0.068 |
| 13 | 0 | 43.26429 | 0.306 | 0.068 |
| 14 | 0 | 43.26429 | 0.306 | 0.068 |
| 15 | 0 | 43.26429 | 0.306 | 0.068 |
| 0 | 1 | 43.26429 | 0.306 | 0.068 |
| 1 | 1 | 43.26429 | 0.306 | 0.068 |
| 2 | 1 | 43.26429 | 0.306 | 0.068 |
| 3 | 1 | 43.26429 | 0.306 | 0.068 |
Now that we have the dataframe pred_IV ready for predicting the dependent variable (earning), we can use the R function predict() to calculate fitted earning using the regression model and the values specified in each row in pred_IV. Then, use cbind() to combine this fitted Y value vector with your pred_IV for plotting.
# use `predict` to predict the Y
predicted_earning <- predict(m5, # the model you are using
pred_IV, # the df you use for predicting
interval = "confidence", # set CI
level = 0.95)
# bind the columns
pred_result <- cbind(pred_IV, predicted_earning)
# check df
head(pred_result, 20) %>% kbl("html") %>% kable_classic_2(full_width = F)
| edu | female | age_recode | black | other | fit | lwr | upr |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 43.26429 | 0.306 | 0.068 | 22.93127 | 21.12317 | 24.73936 |
| 1 | 0 | 43.26429 | 0.306 | 0.068 | 29.01392 | 27.46140 | 30.56644 |
| 2 | 0 | 43.26429 | 0.306 | 0.068 | 35.09657 | 33.78940 | 36.40374 |
| 3 | 0 | 43.26429 | 0.306 | 0.068 | 41.17922 | 40.10017 | 42.25827 |
| 4 | 0 | 43.26429 | 0.306 | 0.068 | 47.26188 | 46.38025 | 48.14350 |
| 5 | 0 | 43.26429 | 0.306 | 0.068 | 53.34453 | 52.60462 | 54.08443 |
| 6 | 0 | 43.26429 | 0.306 | 0.068 | 59.42718 | 58.73804 | 60.11632 |
| 7 | 0 | 43.26429 | 0.306 | 0.068 | 65.50983 | 64.76173 | 66.25793 |
| 8 | 0 | 43.26429 | 0.306 | 0.068 | 71.59248 | 70.69714 | 72.48783 |
| 9 | 0 | 43.26429 | 0.306 | 0.068 | 77.67514 | 76.57927 | 78.77100 |
| 10 | 0 | 43.26429 | 0.306 | 0.068 | 83.75779 | 82.43210 | 85.08348 |
| 11 | 0 | 43.26429 | 0.306 | 0.068 | 89.84044 | 88.26842 | 91.41246 |
| 12 | 0 | 43.26429 | 0.306 | 0.068 | 95.92309 | 94.09489 | 97.75130 |
| 13 | 0 | 43.26429 | 0.306 | 0.068 | 102.00575 | 99.91513 | 104.09636 |
| 14 | 0 | 43.26429 | 0.306 | 0.068 | 108.08840 | 105.73122 | 110.44558 |
| 15 | 0 | 43.26429 | 0.306 | 0.068 | 114.17105 | 111.54442 | 116.79768 |
| 0 | 1 | 43.26429 | 0.306 | 0.068 | 21.36018 | 19.52665 | 23.19371 |
| 1 | 1 | 43.26429 | 0.306 | 0.068 | 24.31464 | 22.72939 | 25.89990 |
| 2 | 1 | 43.26429 | 0.306 | 0.068 | 27.26911 | 25.92251 | 28.61571 |
| 3 | 1 | 43.26429 | 0.306 | 0.068 | 30.22357 | 29.09985 | 31.34729 |
# Plot
pred_result %>%
mutate(gender = ifelse(female == 0, "Male", "Female")) %>% # Covert dummy to character variable
ggplot(aes(x = edu, y = fit)) +
geom_line(aes(linetype = gender)) + # group linetype by gender
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = gender), alpha = 0.3) + # add 95% CI
theme_classic() +
labs(x = "Years of Education",
y = "Predicted Earnings") +
ggtitle("Predicted Earnings by Education and Gender",
subtitle = "(Modeled with interaction between education and gender)")
Plot the effect of age on earn according to Model 5. Post your plot on Slack.
# Your code here
We can use F-test to compare two regression models. The idea behind the F-test for nested models is to check how much errors are reduced after adding additional predictors. A relatively large reduction in error yields a large F-test statistic and a small P-value. The P-value for F statistics is the right-tail probability.
If the F’s p-value is significant (smaller than 0.05 for most social science studies), it means that at least one of the additional \(\beta_j\) in the full model is not equal to zero.
The F test statistic for nested regression models is calculated by:
\[F = \frac{(SSE_\text{restricted} - SSE_\text{full})/df_1}{SSE_\text{full}/df_2} \] where \(df_1\) is the number of additional predictors added in the full model and \(df_2\) is the residual df for the full model, which equals \((n - 1 - \text{number of IVs in the complete model})\). The \(df\) of the F test statistic is \((df_1, df_2)\).
For example, according to the equation, we can hand-calculate the F value for m4 vs m5:
# SSE_restricted:
sse_m4 <- sum(m4$residuals^2)
# SSE_full:
sse_m5 <- sum(m5$residuals^2)
# We add one additional IV, so:
df1 = 1
# Residual df for the full model (m5):
df2 = m5$df.residual
# Calculate F:
F_stats = ((sse_m4 - sse_m5)/df1)/(sse_m5/df2)
F_stats
## [1] 246.5851
# Check tail probability using `1 - pf()`
1 - pf(F_stats, df1, df2)
## [1] 0
You can also use anova() to perform a F-test in R.
anova(m4, m5)
# F's p-value is significant: the additional IV in m5 has a non-zero effect
(You can use pencil and paper) Show that \[\frac{(SSE_\text{restricted} - SSE_\text{full})/df_1}{SSE_\text{full}/df_2} = \frac{(R^2_\text{full} - R^2_\text{restricted})/df_1}{(1 - R^2_\text{full})/df_2}\]
Create a new variable age_sq in earnings_df that is the square term of age_recode. Estimate Model 6: earn ~ age + edu + female + race + edu*female + age_square
Then, perform a F-test between m5 and m6. What is your null and alternative hypothesis? What’s your decision of the F-test?
# Your code here
Github is a version-control system. Similar to the “track change” function in Word, Github allows you to track changes to your code. At the same time, Github allow you to share your code and collaborate with other programmers.
In Github, you organize projects using the Github “Repository” (“repo”). A specific repo will contain all files and associated history of that project. For example, I created a repo for TAing this class, named “intro_to_stats_2021”, this repo contains all the files I created for the class and their editing history.
To start using Github, you will start by creating a repo on your Github account.
Then, you can start coding in this sync-ed repo folder by creating a .Rproj file and other files. Once you made changes to this folder and want to save the changes to your Github editing history, you should open the Github Desktop application and commit and push your changes.
You can see newly added files in the “Changes” panel. You need to add commit comment each time you make changes. This helps you identify your work history.
Sign up Github (your nyu email can give you “pro” Github account).
Download Github Desktop and sign in with your Github account.
In your Github account, create a repository named “lab6_git_demo”. You can either choose it to be “private” or “public”.
Sync this repo to your local folder using Github Desktop (I recommend you to create a designated “git_repos” folder and sync it there).
Copy all lab6 related files to your repo folder. Then, commit and push the changes to your Github repo.
Check your “lab6_git_demo” repo page on Github, does it have all files updated?
Delete some code of your choice in your lab6 coding file, then commit and push the change to your Github repo.
Check your “lab6_git_demo” repo page on Github, click your commit ID, what do you see?